11/19/2020

Agenda

  • Introduction
  • K-means
  • Hierarchical clustering
  • Principal Component Analysis (PCA)
  • Principal Component Regression (PCR)

Recap

Unsupervised learning

We have seen a lot of models for \(y \mid x\). This is called supervised learning (we explain the relationship between \(x\) and a special chosen variable \(y\)).

Now we will see models for \(x\) alone.

  • This generally means that analysis goals are not clearly defined and we cannot directly validate the findings (no such thing as RSS)
  • However, there are still many important questions that we can consider:
    • Finding lower-dimensional representations
    • Finding clusters / groups

Principal Components Analysis (PCA)

Main goal

  • Principal component analysis (PCA) produces a low-dimensional representation of a dataset
  • Basic idea: Find a low-dimensional representation that approximates the data as closely as possible in Euclidean distance


  • It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated
  • It can produce variables for use in supervised learning problems (PCA regression), and it can also serve as a tool for data visualization

Example

Example

Example

Example

PCA: details

The first principal component (PC1) of a set of features \(X_1,\dots, X_p\) is the normalized linear combination of the features \[Z_1 = \varphi_{11} X_1 + \varphi_{21} X_2 +\dots + \varphi_{p1} X_p\] that has the largest variance. By normalized, we mean that \(\sum_{j=1}^{p} \varphi_{j1}^2 = 1\)


In this example:

  • \(Z_{1} = \frac{\sqrt{2}}{2} X_{1} + \frac{\sqrt{2}}{2} X_{2}\)
  • \(Z_{2} = - \frac{\sqrt{2}}{2} X_{1} + \frac{\sqrt{2}}{2} X_{2}\)

PCA: loadings

  • We refer to the elements \(\varphi_{11}, \dots, \varphi_{p1}\) as the loadings of the first principal component; together, the loadings make up the principal component loading vector, \(\boldsymbol{\varphi}_1 = (\varphi_{11}, \dots, \varphi_{p1})^\intercal\)
  • The loading vector \(\mathbf{\varphi}_1\) defines a direction in feature space along which the data vary the most


In this example:

  • \(\boldsymbol{\varphi}_{1} = \left( \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2} \right)^{\intercal}\)
  • \(\boldsymbol{\varphi}_{2} = \left( - \frac{\sqrt{2}}{2}, \frac{\sqrt{2}}{2} \right)^{\intercal}\)

PCA: details

  • If we project the \(n\) data points \(\mathbf{x}_1,\dots, \mathbf{x}_n\) onto this direction, the projected values are the principal component scores \(\mathbf{z}_{1},\dots, \mathbf{z}_{n}\)

PCA: details

  • The second principal component is the linear combination of \(X_1,\dots,X_p\) that has maximal variance among all linear combinations that are uncorrelated with \(Z_1\)
  • It turns out that constraining \(Z_2\) to be uncorrelated with \(Z_1\) is equivalent to constraining the direction \(\mathbf{\varphi}_2\) to be orthogonal to the direction \(\mathbf{\varphi}_1\)

PCA directions, scores

  • PC loadings (directions):
    • The PC1 direction is the direction along which the data has the largest variance
    • The PCm direction is the direction along which the data has the largest variance, among all directions orthogonal to the first \(m - 1\) PC directions
  • PC scores:
    • The PCm score for the observation \(\mathbf{x}_{i}\) is the position of \(\mathbf{x}_{i}\) along the \(m^{th}\) PC direction

Other intepretations of PCA

  • Best approximation interpretation:
    • PC1 score \(\times\) PC1 direction = the best 1-dimensional approximation to the data in terms of MSE
    • \(\sum_{m=1}^{M}\) PCm score \(\times\) PCm direction = the best \(M\)-dimensional approximation to the data in terms of MSE
  • Eigenvector interpretation:
    • The PCm direction is the \(m^{th}\) eigenvector (normalized to unit length) of the covariance matrix, sorting the eigenvectors by the size of their eigenvalues

Illustration

  • USAarrests data: for each of the fifty states in the United States, the data set contains the number of arrests per \(100,000\) residents for each of three crimes: Assault, Murder, and Rape. We also record UrbanPop (the percent of the population in each state living in urban areas)
  • The principal component score vectors have length \(n = 50\), and the principal component loading vectors have length \(p = 4\)

Illustration

Variable scaling

  • If the variables are in different units, scaling each to have standard deviation equal to one is recommended
  • If they are in the same units, you might or might not scale the variables

Proportion of Variance

  • To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one
  • The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as \[\sum_{j=1}^p \text{Var}(X_j) = \sum_{j=1}^p \frac{1}{n} \sum_{i=1}^n x_{ij}^2\] and the variance explained by the \(m^{th}\) principal component is \[\text{Var}(Z_m) = \frac{1}{n} \sum_{i=1}^n z_{im}^2\]

Proportion of Variance

The PVE of the \(m^{th}\) principal component is given by the positive quantity between 0 and 1 \[\dfrac{\sum_{i=1}^n z_{im}^2}{\sum_{j=1}^p \sum_{i=1}^n x_{ij}^2}\]

Principal Components Regression (PCR)

Principal Components Regression

  • We have seen methods of reducing model variance by:

    • using a less flexible model (with fewer parameters)
    • selecting a subset of predictors
    • regularization / shrinkage
  • Another approach: transform the predictors to a lower dimensional space using PCA

  • Combining PCA with linear regression leads to principal components regression (PCR)

Principal Components Regression

PCR = PCA + linear regression:

  1. Choose how many PCs to use, say, M.

  2. Use PCA to define a new feature vector \(\mathbf{z}_{i}\) containing the PC1, …, PCM scores for \(\mathbf{x}_{i}\)

  3. Use least-squares linear regression with this model:

    \[Y_{i} = \mathbf{z}_{i}^\intercal \mathbf{\beta} + \varepsilon_{i}\]

PCR works well when the directions in which the original predictors vary most are the directions that are predictive of the outcome.

PCR versus other methods

PCR versus least-squares:

  • When \(M = p\), PCR = least-squares
  • PCR has higher bias but lower variance
  • PCR can handle \(p > n\)

PCR does not select a subset of predictors/features, and therefore it is more closely related to Ridge than Lasso.

Can choose PCR dimensionality M using cross-validation.

Question time